1. New to Jupyter notebooks?¶
- This is a jupyter notebook file (file extension is .ipynb for python notebook versus .py for standard python scripts)
- It is a format used by many data scientists and researchers for analysis and visualization tasks
- Useful because (a) it is easier to work with and understand than single python script files (.py) and
- (b) enables you to break problems down into steps, and quickly see those results in the same window
- Key point: for data analysis a jupyter notebook is clear and efficient way to conduct analysis and communicate the findings
Two types of Jupyter notebook cells:¶
(1) Markdown cells (Cell->Cell Type->Markdown):
- Descriptions
- text (this entire cell that you are looking at is a Markdown cell where text exists)
- images
- Key point: Markdown cells are a place for notes and descriptions
(2) Code "cells" (Cell->Cell Type->Code) : Actual code
- A code "cell" store and compile your code
- any non-code text must be formatted as a comment - begin with "#"
- Key point: code cells are the workhorse, they execute your program and output results
Running Jupyter notebook cells¶
(1) To run every line of code in the entire notebook
- Cell->Run All
(2) To run only a single code cell
- Click on the cell and hit Shift-Enter
In [2]:
# This is a code cell
# This is the place where your code is written
# Any lines beginning with the "#" are not code but descriptions to help you and others know what is being done
# hit shift-Enter together to run this cell
In [3]:
# This is another code cell
# Here we have some code written....it is a 'function' , let's not worry about details of 'functions' at the moment
# it is just for seeing outputs
# in jupyter notebooks, codes show their outputs below the code cell that was run
def perserverance(text):
return "so long as you do not stop!"
print(f"{perserverance('It does not matter how slowly you go...')}")
# below this example code cell an output will appear
In [4]:
# another example
print("this will print a message to the output")
# print("some text") messages help to a) show result of the code, b) understand what went wrong
2. New to pandas?¶
- Pandas is a powerful data manipulation library in Python
- It is widely used in data science, machine learning, scientific computing, and many other data-intensive fields
- Useful because (a) it provides flexible data structures for efficient manipulation of structured data and
- (b) it has rich functionality for data cleaning, transformation, and analysis
- Key point: for data analysis, pandas provides a high-performance, easy-to-use data structure (DataFrame) and data analysis tools.
Two main data structures in pandas:¶
(1) Series :
- A one-dimensional labeled array capable of holding any data type
- It is similar to a column in a spreadsheet, a field in a database, or a vector in a mathematical matrix
- Key point: Series is the primary building block of pandas
(2) DataFrame :
- A two-dimensional labeled data structure with columns potentially of different types.
- It is similar to a spreadsheet or SQL table
- Key point: DataFrame is the primary pandas data structure for data manipulation and analysis
Working with pandas¶
(1) To import the pandas library
import pandas as pd
(2) To create a DataFrame
df = pd.DataFrame(data)
(3) To read a CSV file into a DataFrame
df = pd.read_csv('file.csv')
(4) To get the first 5 rows of the DataFrame
df.head()
3. Load your packages¶
In [5]:
# uncomment the line below this if you want to see which packages you already have installed
# The '!' character below is used to run this directly in the notebook
# conda is the package manager recommended
# ! conda list
In [6]:
# I recommend using conda to install install itables in your terminal
# why in the terminal and not in the code cell? : because conda requires you to input a yes, no to execute the installation,
# which may or may not show-up in the cell output
# In the terminal type: conda install itables
In [7]:
import pandas as pd # should already be pre-installed with conda
from itables import init_notebook_mode, show #to fully take advantage of the ability to display our data as a table format, let's use the itables library
init_notebook_mode(all_interactive=True) #After this, any Pandas DataFrame, or Series, is displayed as interactive table, which lets you explore, filter or sort your data
In [8]:
# the following code cell will give you a nice hover highlighting on your tables when used with itables library
In [9]:
%%html
<style>
.dataTables_wrapper tbody tr:hover {
background-color: #6495ED; /* Cornflower Blue */
}
</style>
<!-- #1E3A8A (Dark Blue) -->
<!-- #0D9488 (Teal) -->
<!-- #065F46 (Dark Green) -->
<!-- #4C1D95 (Dark Purple) -->
<!-- #991B1B (Dark Red) -->
<!-- #374151 (Dark Gray) -->
<!-- #B45309 (Deep Orange) -->
<!-- #164E63 (Dark Cyan) -->
<!-- #4A2C2A (Dark Brown) -->
<!-- #831843 (Dark Magenta) -->
<!-- #1E3A8A (Dark Blue ) -->
<!-- Suggested Light Colors for Light Backgrounds -->
<!-- #AED9E0 (Light Blue) -->
<!-- #A7F3D0 (Light Teal) -->
<!-- #D1FAE5 (Light Green) -->
<!-- #DDD6FE (Light Purple) -->
<!-- #FECACA (Light Red) -->
<!-- #E5E7EB (Light Gray) -->
<!-- #FFEDD5 (Light Orange) -->
<!-- #B2F5EA (Light Cyan) -->
<!-- #FED7AA (Light Brown) -->
<!-- #FBCFE8 (Light Magenta) -->
4. Read in the data¶
In [10]:
# here we will load a historical US tornado dataset using pandas
# you can give it any name you want...just has to follow python convention for names i.e., so-called snake case
tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv")
print(tor_data.head())
# You can show the contents of this dataset as a table below by simply typing its name and running this code cell OR using print(data_set_name)
In [11]:
# now see how it looks as an interactive table. You can click on the arrows by the colum names to sort them...
tor_data.head() # note that you do not have to use show() to generate an interactive table unless you want to add addtional customizations
Out[11]:
In [12]:
# To read any dataset into python you need to specify:
# 1. WHAT - name to save the data to
# 2. WHO - is going to do the work? i.e., which package is going to do the work?, here we rely on pandas 'pd' package or library
# 3. HOW - read_csv(), the specific method that pandas will use
# 4. WHERE - is the file? the file is specified by its name with relative path (i.e. '.\') or by its full path, including its extension, in this case .csv is the extension
# (1) (2) (3) (4) actual file name and its path is always in quotes
tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv") # Note: we use an 'r' path_name_here because 'r' allows us to a) copy paste paths directly from your explorer, and mainly b) use '\' without causing error with python
In [13]:
# a good habit is to make a copy of the original data BEFORE making any changes
orginal_tor_data=tor_data.copy() # now when we want to compare or revert to original data we have a way to do that
In [14]:
# use the itables package with the .show() method
# notice how we can customize the table
show(tor_data.head(), caption="Tornado Data 1950-2022", options={'hover': True}) # itables also allows you to add a title or caption to your table
# caption add a title to the table, it is below the table But still used full for descriptions
# add a hover highlighting function
In [ ]:
# column_filters="header" extension for itables, my personal favorite for exploring table data
show(tor_data, column_filters="header",layout={"topEnd": None}, maxBytes=0 ,options={'hover': True} ) # Adds individual column filters and removes the single default search bar (which isn't that useful )